"Visit with us" Tourism Customer Classifier

Background and Context

You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.

A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.

One of the ways to expand the customer base is to introduce a new offering of packages.

Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King.

Looking at the data of the last year, we observed that 18% of the customers purchased the packages.

However, the marketing cost was quite high because customers were contacted at random without looking at the available information.

The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.

However, this time company wants to harness the available data of existing and potential customers to make the marketing expenditure more efficient.

You as a Data Scientist at "Visit with us" travel company have to analyze the customers' data and information to provide recommendations to the Policy Maker and Marketing Team and also build a model to predict the potential customer who is going to purchase the newly introduced travel package.

Objective

  1. To predict whether a customer will purchase the newly introduced travel package
  2. Which variables are most significant
  3. Understanding Customers segmentations for targeting

Data Dictionary

Business Problem & Objective Analysis

Purpose:

Growing the customer base while keeping marketing costs efficient. Predicting consumers that can be targeted for the product pitch to be likely to purchase the product Currently, since conversion rates are at 18%, the marketing department needs to identify which customers have a higher probability of purchasing the product but also need to keep expenditure costs to a minimum.

Objective:

Give the marketing department and policymakers information about which data features are the most significant and which segment of customers should be targeted more often. With given data features, the marketing department could also use a model which can accurately predict whether a customer will buy a product.

Focus:

The travel company wants to ensure that more people accept the product while also growing the customer base. The model should aim to reduce the number customers falsely labeled as not taking on the product while also correctly labeling customers. Thus, more analysis will be done on the metric of the model that should be emphasized right before modeling.

Data Dictionary Analysis:

  1. Customer detail information describes the Age, Gender, Marital Status, Occupation, Designation, and Monthly Income
  2. There is also information about how developed their city is
  3. Information about owning a car and having a passport
  4. Information about trips taken, preferred hotel ratings, number of people on trip and number of children under 5
  5. The customer ID will be irrelevant to predicting purchase.
  6. There is data about employee interactions with the customer relating to the Pitch: product pitched, duration of pitch, pitch satisfaction score, and number of followups

Let's start analyzing!

Library and Data Import

Importing the libraries

Importing Data and First Glance

Data Dictionary Dataframe defined

Tourism Dataframe Defined

Reordering columns so that they are in order according to Customer Details and Customer Interaction data

Dataframe Analysis and Preprocessing

A general overview of the dataset

  1. The shape of the dataset
  2. Checking the datatypes, null values, and the number of unique values

Oberservations:

  • CustomerID can be dropped because it won't help with predictive modeling
  • Need to fix Gender data as it has a typo
    • 'Fe Male' replaced by 'Female'
  • Null values included in
    • Age, MonthlyIncome, NumberOfTrips, PreferredPropertyStar, NumberOfChildrenVisiting, DurationOfPitch, and NumberOfFollowups
  • Need to fix categorical variables datatypes
    • Categories: Gender, MaritalStatus, TypeofContact, Occupation, Designation, CityTier, OwnCar, Passport, PreferredPropertyStar, NumberOfPersonVisiting, NumberOfChildrenVisiting, ProductPitched, PitchSatisfactionScore, ProdTaken
      • Ordinal - Designation, CityTier, PreferredPropertyStar, NumberOfPersonVisiting, NumberOfChildrenVisiting, PitchSatisfactionScore
      • Nominal - Marital Status, TypeofContact, Occupation, ProductPitched
      • Binary - Gender, OwnCar, Passport, ProdTaken

Data Cleaning

Dropping the ID column as it won't be beneficial to the model

Fixing Genders data (Fe Male to Female)

Applying proper datatype formatting to categories

Defining variables are being made for easy calling of these datatype columns

Applying Category Datatype

Before Missing Value Treatment, Let's Look at a Statistical Analysis

Statical Describing of the Data

Extreme Value Analysis

MontlyIncome, NumberOfTrips, DurationOfPitch

High Value in MonthlyIncome

The highest monthly income of these 2 customers is significantly higher than everyone elses. Neither of these customers bought a product, so their data is not as important since 82% of our customers in this dataset didnd't take the pitched product

The monthly income average for Executives is 19,939. Because of this, I am choosing to drop these two rows of data that have values over 90000.

High Value in NumberOfTrips

While the four highest number of trips in a year (19-22 trips) is much higher than the average, this is a real-world possibility. We will leave these values as is for now.

High Value in DurationOfPitch

Assumed human error in input since the closest values to 126 and 127 are 36. I replaced the larger numbers without the extra digit in the beginning.

Missing Values Treatment

From earlier analysis, we know these columns have missing values:

  • Age 4.62% missing
  • TypeofContact - .51% missing
  • MonthlyIncome - 4.77% missing
  • NumberOfTrips - 2.86%% missing
  • PreferredPropertyStar - .53% missing
  • NumberOfChildrenVisiting - 1.35% missing
  • DurationOfPitch - 5.14% missing
  • NumberOfFollowups - 0.92% missing

Age

Imputed missing Age values by the average age grouped by those who took the product and what product was pitched. Rounded because age shouldn't have decimals

TypeofContact

Imputed missing TypeofContact values by the most common type of contact grouped by their occupation and designation in their organization.

Monthly Income

Imputed the MonthlyIncome missing values by the average of MonthlyIncome grouped by the person's occupation and designation. This will return an average that will have similar amounts of incomes so this is a reasonable imputation.

NumberOfTrips

Imputed the NumberOfTrips missing values by the average of the number of trips grouped by the individuals occupation. This will return an average based on the occupation of the person. Free-lancers and others may have more time to take more trips- therefore a good grouper for this

PreferredPropertyStar

Imputed the PreferredPropertyStar missing values by the most commong preffered property stars grouped by which product tier was pitched. This will return reasonable values for the imputations

NumberOfChildrenVisiting

Two options are presented here:

  1. Imputed the NumberOfChildrenVisiting missing values by the average of the number of children under 5 grouped by their martial status
  2. Replace missing values with 0- indicating a missing value means they have 0 kids under the age of 5

I chose the second option assuming that a missing value meant there are no kids visiting under the age of 5

DurationOfPitch

Can't say there was no duration of pitch because there is a product pitched, followups, and a PitchSatisfactionScore for those with missing pitch duration information

There were 2 options for imputations here:

  1. Use the smallest seen duration of 5 minutes for all missing values. Replacing by 0s was not an option because there was a Pitch Satisfaction score meaning there had to have been some pitch
  2. Impute the average pitch duration grouped by what product it is and the pitch satisfaction score

I choose the 2nd option because grouped by these two features, we will see a better fit for the missing values.

NumberOfFollowups

Imputed the NumberOfFollowups missing values by the average of the number of followups grouped by whether the product was taken and what product was pitched. This will return reasonable values for the imputations.

Data Analysis after Cleaning

Automatic Data Profiling and Reports

Info on Dataframe again (Just for information)

Correlations and Associations

With the Interactive PandasProfiling, we can see the Phik Correlations for all categorical and continuous columns

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution

These will more more analysis done throughout the workbook.

Customer Characteristics by Product/Package

Simple customer characteristics based on packages. There will be more information on these variables for the Policy Maker and Marketing as we go through data visualization and predictive modeling.

Most customers that take a product are 31 (average 34), males, married, self enquired, salaried, executives, have a monthly income average of 22100, take about 3 trips a year. The pitch duration is about 16 minutes with a higher average of followups

We can see the information grouped by the product that was pitched and the most common value seen. Customers who are older tend to get pitched higher tier packages. The monhtly income of those who are pitched King and Super Deluxe packages are, on average, over 33,000.

We can see the duration of the pitch is about 15-16 minutes for most packages but down to 12 minutes for King.

Grouping by both of these will complete our brief overview of the 1) customers associated with each package, 2) customers associated with each package they bought

Simple customer characteristics based on packages. There will be more information on these variables for the Policy Maker and Marketing as we go through data visualization and predictive modeling.

As the packages get higher in tier (basic>standard>deluxe>super deluxe>king), we see an increase in age, monthly income mainly.

There is an interesting difference such that Deluxe packages are bought by younger people with a lower monthly salary than standard even though standard is a tier below this.

The duration of the pitch increases and we see the highest pitch times in those who bought a standard package and a deluxe package. Except for the King package which has a low duration of pitch for those who bought- meaning it is easier to convince older individuals over 48 with an income average of 34,000 to buy the king package.

More customer segmentation information in the next block!

Category Distribution with Respect to ProdTaken

Data Visualization

Defined Functions

Univariate

Categorical Variables

Observation and Insights

Continuous Variables

Observations and Insights:

Bivariate & Multivariate Analysis

ProdTaken against Continuous column

This shows us that there is a slightly lower distribution of age and monthly income amongst those who have taken a prodcut. The duration of the pitch and the number of followups also contributes to this.

ProdTaken Against Continuous Columns with regards to ProductPitched

Observations and Insights

Distribution Density of each Continuous column depending if they purchased package

Observations and Insights

ProdTaken against Categorical Columns

This data represents the data distribution amongst categorical columns. This is a visualization of analysis that was done earlier. Let's look at this data with regards to the ProductPitched

ProdTaken Against Categorical Columns with regards to ProductPitched

Observations and Insights

-Between Males and Females, we see that the basic package is the most likely to be purchased

ProdTaken with regards to ProductPitched and Duration of Pitch

This shows us the Duration of Pitch for customers who took the product and which product it was

These graphs show us the Duration of the Pitch for customers in regards to how much time is given for them and how likely they are to take the product after that. This data is interesting to see that the King Package usually is not pitched for that long. Many relationships seen here have been noted in comments for the Policy maker and marketing team to have a greater understaneding of the data

ProdTaken with regards to ProductPitched and MonthlyIncome

This shows us the Monthly Income for customers who took the product and which product it was

Only customers with a high monthly income were pitched King packages. We see that many customers who have an income of over 25,000 are likely to purchase a basic or deluxe plan. Divorced customers have relatively higher monthlyincomes and they are advertised all package, but don't purchase them very often regardless of income.

Pairplot Overview

There are no notable significanes so let's add information about whether they took the product

Much of this information has been noted already but we see that younger individuals with average monthly incomes buy packages more often than older customers. The duration of pitch also seems to be lower for those who bought a packge as the montly incomes rises.

We can see that the basic product is pitched to younger customers with a lower average monthly income. The deluxe package is often pitched to those who have average values of Montly income. The duration of the pitch stays distributed throughout. King packages are pitched to those who are older and make more money.

Data Preprocessing / Feature Engineering

  1. Treating outliers
  2. Feature Engingeering
  3. Proper datatypes conversion
  4. Investigating which feature to remove between Age and Experience to avoid Multicolinearity!

Outliers

This data was highlighted in the data visualization section. Let's look at it again:

Oberservation:

Feature Engineering:

  1. Restructuring columns according to tiers
  1. Changing NumberOfChildrenVisiting to a binary categorical variable asking if they are bringing a child or not

Restructuing Categorical Data in order of significance

NumberOfChildrenVisiting Feature to HasChild

Final Overview of Data & Prep for Model

  1. We will be looking at the dataframe information again
  2. Insert dummy variables where needed while ensuring to drop the first dummy variable to avoid the trap
  3. Create X and y dataframes with independent and depedent (target) variable
  4. Split data for training/testing at .7/.3 respectively
  5. Checking VIF scores

Quick look at dataframe description

Creating dummy variables

Checking Product Taken distribution amongst this dataset

Splitting into X and y datasets for modeling

Splitting the datasets for train/test

Distribution Checking

Multicolinearity Check

Model Defined Functions (Metric Scores, Confusion Matrix, and more)

Modeling Begins

Modeling Table of Contents:

  1. Metric Insights
  2. Bagging
    • Random Forest
    • Decision Tree
    • Bagging Classifier
  3. Boosting
    • AdaBoost
    • Gradient Boost
    • XGBoost
  4. Classifiers Comparison
  5. Stacking Classifier
  6. Final Classifier Analysis

For Each Classifier:

Preferred Metric Insights:

Note: Assigning a class weight is important because of the imbalanced dataset.


What does a tourism company want?

Which loss is greater?

  1. Predicting that customer will purchase the package but customer does not- wasting marketing expenditure.
  2. Losing out on new customers and growth by missing opportunity.

F1 Score is the best measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).

Since we want grow the customer base while also making sure to keep marketing expenditure efficient, we should use the F1 as a metric of model evaluation

In this case, not being able to identify a potential customer is similar to wasting marketing expenditure in terms of loss we can face. Hence, F1 score is the right metric to check the performance of the model.

Bagging Models

Decision Tree Classifier (model name = dtree, dtree_tuned, dtree_ccp)

Base Decision Tree

The base decision tree is overfitting the training data by a lot. Let's try hyperparameter tuning or we can try cost-complexity tuning!

Decision Tree Hypertuning (model name = dtree_tuned)

While the hypertuned model decreased training data scores, it definitely fit the testing data much better. The recall score is good and fits well, but there is a low value for precision. Precision is also important in this situation!

Let's try Cost Complexity Pruning

Decision Tree Cost Complexity Pruning (model name = dtree_ccp)

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree:

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

We can choose alpha 0.002 retaining information and getting higher recall. We can also get the value where we get the best alpha value.

While the metrics in this model are better than the tuned model, this model overfits the training data- just like our original base decision tree model.

Let's move onto another classifier and see how it performs

Bagging Classifier (model name = bagging1, bagging1_tuned)


Base Bagging Classifier

The base bagging classifier does a good job on metrics but seems to be overfitting our training data. This can be seen especially with the recall values.

  • Compared to our tuned decision tree model, it does better on testing metrics but is overfitting the model.

Let's try adding class weight to this classifier

Weighted Bagging Classifier

Similar to the base bagging classifier, we see overfitting on the training data with miniscule changes to test data metrics.

  • This model does take into account class weights, so it is already better than the base model regardless

Let's hypertune the parameters

Bagging Classifier Hypertuning (model name = bagging1_tuned)

Some of the important hyperparameters available for bagging classifier are:

We are seeing similar results in our confusion matrix and metric scores as we saw in the weighted bagging classifier. There is also overfitting on the training data!

Let's move on to another model.

Random Forest Classifier (model name = rand_forest, rand_forest_tuned)

Base Random Forest Classifier

This base random forest model overfits the training data. The recall score shows this most apparently.

Let's get into hypertuning for this model!

Random Forest Hypertuning

Some of the important hyperparameters available for random forest classifier are:**

From this tuned model, we see that it doesn't overfit the training data as much as the base model

  • Our best model yet- in terms of not overfitting the data and metric scores
  • The Tuned Bagging is the next best model but it is overfitting the data more than this with similar test data metrics
  • The Tuned Decision tree is the next best model. The metrics aren't very high on precision but it is the best fit so far

Bagging Classifiers Overview and Feature Importances so far

With this model, the customers Age is the most important factor in determining if they will buy a product followed by the Duration of the Pitch. The customer's MonthlyIncome and having a passport are also very significant relatively!

Let's move on to another group of classifiers now

Boosting

AdaBoost Classifier (model name = ada_boost)

Base AdaBoost Classifier

The base Ada boost classifier returns very low recall scores so this is not a good model.

Let's hypertune the parameters!

AdaBoost Hypertuning

The tuned parameters helped with improving metric scores but it is now overfitting our training data.

  • Similar to the tuned random forest model, but the metric scores are a little lower and we see more overfitting on the train data

Moving onto Gradient Boosting!

Gradient Boosting Classifier (model name = grad_boost)

Base Gradient Boosting Classifier

The base gradient boost classifier shows better fitting of the training data but we see low recall score. This will be a problem for the travel company so we will need to work on this model.

Let's try to initialize it using AdaBoost and see how it does

Gradient Boosting Classifier initialized by AdaBoost

Initialized by the AdaBoost classifier, we saw a loss in the overall F1 score but a slightly better fit. There is still room for this model to improve.

Hypertuning time!

Gradient Boosting Classifier Hypertuning

The metrics improved, especially the F1 score but we see that it has began to overfit our training data again. This is very comparable to tuned and may be the best classifier we have seen yet

Let's build a couple more models

XGBoost Classifier (model name = xgb_boost)

Base XGB Boosting Classifier

While test data metrics look good, we need to ensure the model isn't overfitting the training data too much. This is why we see the training and testing metrics differ so significantly.

Hypertuning time!

XGB Boosting Classifier Hypertuning

While the metrics are comaprable to the base XGBoost model, this fits the training data slightly better (but not a lot better). It is a good classifier balancing between Precision and Recall but we see the differences because it is still overfitting.

Comparing All Models

Classifier analysis

Stacking Classifier

Returned comparable metrics as we saw in the comparison table earlier. The Training data is still being overfit but metrics can be improved. Let's see if we can get better results with tuned estimators in the stacking classifier!

The test f1_score os relatively higher than what we saw on any other classifier. There is still training data being overfit, but this is the best model we created in terms of the f1_score and relative fitting

Final Classifier Tuning and Analysis

An analysis on specifics within each test and training set. We can see how the training data was overfit with very good scores. The metrics on precision are lower on the dataset but we have a high recall score so this balances out. We chose to focus on the F1 Score because it is a balance between both precision and recall and both scenarios are important for the travel company.

From the above graphic, we can see that for this case the Area Under Curve (AUC) is aproximately 0.95. In general, when AUC score is 1, a perfect classifier is represented, and when 0.5 a worthless classifier is represented.

From here, we can set our threshold value up to .6 but this will still return a lower precision score while only slightly improving the precision.

After setting the threshold value, we did not see more improvements in the metrics. There was a slight decrease to the overall F1-Score and there is more overfitting on this model

Visualize Feature Importances

The stacking classifier can be a very complicated model to understand, so we will be showing the Marketing team and Policy makers some of the most important features we found throughout our modeling.

Looking at the estimators and their feature importances, we can compile features that are to be focused on in case there is no modeling available.

Business understanding of the model

Purpose of this Project at Visit with Us:

Growing the customer base while keeping marketing costs efficient. Predicting consumers that can be targeted for the product pitch to be likely to purchase the product Currently, since conversion rates are at 18%, the marketing department needs to identify which customers have a higher probability of purchasing the product but also need to keep expenditure costs to a minimum.

Objective:

Give the marketing department and policymaker information about which data features are the most significant and which segment of customers should be targeted more often. With given data features, the marketing department could also use a model which can accurately predict whether a customer will buy a product.

Focus:

The travel company wants to ensure that more people accept the product while also growing the customer base. The model should aim to reduce the number customers falsely labeled as not taking on the product while also correctly labeling customers. Thus, more analysis will be done on the metric of the model that should be emphasized right before modeling.


The policy makers and marketing team wanted to increase the number of customers it has by introducing a new package product. This will help the travel company grow and get more customers to purchase a product when marketed to.

For this application, it was important that we didn't waste money on expenditure on those who would not buy a package while trying to not miss any customers who may take on a package! Last year's conversion rate was healthy at 18%. Whlie some customers may get missed in identification of those who will buy a package, the company does not want to advertise to someone who will not take on a package but also does not want to miss out on opportunity to gain new customers. Because of this, the F-1 score is the metric to focus on as it tries relies on both presicion and recall values to compute.

The importance of each feature from the model will be used to classify the outcome. The goal is to build a model that identifies customers who are most likely to accept the package offer in future package product campaigns.

After a lot of data analysis and running through thousands of different possible model fits, a stacking classifier built using a layer of 3 different modeling techniques (Bagging Classifier, AdaBoost, and the Random Forest) combined with the XGBoost classifier returned the best results. The model will predict relatively well but there is some over-fitting of the data currently.

As our model expands and learns about new customers, it will be able to improve based on past learning errors. As of right now, there is a decent model for the marketing team to start targeting customers for package pitches!

For the marketing team, there are some things they should always pay attention to:

If the marketing teams needs to target customers as soon as possible, the bank can utilize customer data on high Income, high CCAvg usage, and having a CD Account. If the customer is part of a larger family along with any one those three features, there is a much greater chance they will take on a package product.

There is data available for both the marketing teams and policy maker to help the business grow. By understanding the customer population and delivering good sales pitches, the travel company has a lot of opportunity to grow its customer base while keeping expenditure at a minimum.

For the policy maker: It is very crucial to have employees that know how to advertise and pitch the product. As we saw with customer interaction data, this makes a difference in how likely customers are in buying a product. As a policy-maker, it is your responsibility that there is good documentation and that employees are well-trained in sending out a good pitch. Characteristics of a good pitch, as seen in our data analysis, are having a good pitch satisfaction score, good number of follow ups, and efficiently delivering good information in 14-15 minutes.

For more information, policy makers are recommended to read through the key points highlighted in the next section. There is more information available on good customer interaction and pitch characteristics in the next group of information that is compiled for the marketing team.

For the marketing team, there are some things they should always pay attention to:

The policy makers and marketing team need to work hand in hand to ensure customers are receiving quality pitches as this can greatly influence if a customer purchases a product. While PitchSatisfactionScore is basically survey data, it still doesn't represent the value of having good skills in selling the package!

If the marketing teams needs to target customers as soon as possible, the travel company can utilize these features extracted as the most important features from modeling:

To grow the business as best as possible, it is advisable to also focus on some of the other key factors I have pointed out above. Being aware of certain data features can help to retain and grow the customer base.

Influencing customers with good interpersonal skills during the pitch and following up can also play a bigger factor than can be modeled.